๐Ÿผ Pandas Tutorial

Master Data Analysis with Python

Python Data Science

Introduction to Pandas

Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Built on top of NumPy, pandas is the cornerstone of data analysis in Python.

๐Ÿ“Š

Data Structures

Series and DataFrame for efficient data handling

๐Ÿ”„

Data Manipulation

Powerful tools for reshaping and pivoting data

๐Ÿ“

I/O Support

Read/write data from CSV, Excel, SQL, JSON, and more

๐Ÿงน

Data Cleaning

Handle missing data and duplicates efficiently

โšก

Performance

Fast operations on large datasets

๐Ÿ“ˆ

Analysis Tools

Statistical functions and aggregations

Installation & Setup

Install Pandas

# Install via pip pip install pandas # Install with NumPy and other dependencies pip install pandas numpy matplotlib # Install with conda conda install pandas

Import Pandas

import pandas as pd import numpy as np # Check version print(pd.__version__)
๐Ÿ’ก Convention By convention, pandas is imported as pd - this is the standard alias used throughout the data science community.

Pandas Series

A Series is a one-dimensional labeled array capable of holding any data type. It's like a column in a spreadsheet or a single column of a DataFrame.

Creating Series

# From a list s = pd.Series([1, 3, 5, 7, 9]) print(s) # With custom index s = pd.Series([10, 20, 30], index=['a', 'b', 'c']) # From a dictionary data = {'a': 10, 'b': 20, 'c': 30} s = pd.Series(data) # From NumPy array s = pd.Series(np.random.randn(5))

Series Operations

s = pd.Series([1, 2, 3, 4, 5]) # Accessing elements print(s[0]) # First element print(s[1:4]) # Slicing # Arithmetic operations print(s + 10) # Add 10 to all elements print(s * 2) # Multiply all by 2 # Statistical operations print(s.mean()) # Mean print(s.sum()) # Sum print(s.max()) # Maximum print(s.std()) # Standard deviation

Pandas DataFrames

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

Creating DataFrames

# From a dictionary data = { 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['New York', 'Paris', 'London'] } df = pd.DataFrame(data) # From a list of dictionaries data = [ {'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30} ] df = pd.DataFrame(data) # From NumPy array df = pd.DataFrame( np.random.randn(4, 3), columns=['A', 'B', 'C'] )

DataFrame Attributes

# View first/last rows df.head() # First 5 rows df.tail(3) # Last 3 rows # Basic information df.shape # (rows, columns) df.columns # Column names df.index # Row indices df.dtypes # Data types of columns # Summary statistics df.info() # Detailed info df.describe() # Statistical summary

Reading and Writing Data

Reading Data

# Read CSV file df = pd.read_csv('data.csv') # Read with specific options df = pd.read_csv('data.csv', sep=',', header=0, index_col=0, parse_dates=['date_column']) # Read Excel file df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # Read JSON df = pd.read_json('data.json') # Read from SQL database import sqlite3 conn = sqlite3.connect('database.db') df = pd.read_sql('SELECT * FROM table_name', conn) # Read HTML tables dfs = pd.read_html('https://example.com/table.html') # Read clipboard df = pd.read_clipboard()

Writing Data

# Write to CSV df.to_csv('output.csv', index=False) # Write to Excel df.to_excel('output.xlsx', sheet_name='Sheet1', index=False) # Write to JSON df.to_json('output.json', orient='records') # Write to SQL df.to_sql('table_name', conn, if_exists='replace') # Write to HTML df.to_html('output.html')

Data Selection & Indexing

Column Selection

# Select single column df['column_name'] # Select multiple columns df[['col1', 'col2']] # Using dot notation df.column_name

Row Selection

# Select by label (loc) df.loc[0] # Single row df.loc[0:3] # Multiple rows (inclusive) df.loc[0, 'name'] # Specific cell # Select by position (iloc) df.iloc[0] # First row df.iloc[0:3] # First 3 rows df.iloc[0, 1] # Row 0, Column 1 # Boolean indexing df[df['age'] > 25] df[(df['age'] > 25) & (df['city'] == 'Paris')]

Conditional Selection

# Query method df.query('age > 25 and city == "Paris"') # isin method df[df['city'].isin(['Paris', 'London'])] # String contains df[df['name'].str.contains('Alice')] # Between values df[df['age'].between(25, 35)]

Data Cleaning

Handling Missing Data

# Check for missing values df.isnull() # Returns boolean DataFrame df.isnull().sum() # Count missing per column df.notnull() # Opposite of isnull # Drop missing values df.dropna()